The availability of satellite-based atmospheric column observations are limited by the surface and meteorological conditions of the sensing date, and therefore missing values (pixels) exist in the dataset. Before using these satellite-derived observations as input variables to model the ground-level NO2 concentration, the missing pixels should be first imputed. In principle, the method is to model the relationship between the satellite atmospheric column observation and some covariates (which do not come with missing values) using the pixel-days without missing values, and to implement this model to predict the missing pixels in the satellite-based atmospheric column observation datasets. This document demonstrates the development and implementation of the imputation model for the OMI-NO2 product.

library(stars) ; library(sf) ; library(raster)
library(dplyr) ; library(tidyr) 
library(ggplot2) ; library(ggsci)
library(lubridate) ; library(stringr)
library(randomForest) ; library(ranger)

1 Data preparation

Based on literature reviews, the following variables are selected to model the missing OMI NO2 observations. * CAMS Global Reanalysis data (modeled-NO2 and NO) at various time (00, 03, 06, 09, 12, 15, 18, 21)
* ERA5 meteorological variables at various time (00, 03, 06, 09, 12, 15, 18, 21)
* temperature at 2m, surface pressure, total precipitation, boundary layer height, total cloud cover, wind speed and wind direction (calculated from u and v wind components)
* altitude (EU-DEM v1.1)
* coordinates of the cells: x, y
* Day Of Year

1.1 Import datasets

1.1.1 OMI data

# file paths
files_OMI <- list.files("1_data/raw/OMI-NO2" , "_AOI.tif$" , full.names = TRUE)
# timestamps
ts_OMI_raw <- basename(files_OMI) %>% 
  str_extract("\\d{8}") %>% 
  as_date()
# read as a 3D array: x, y, date
OMI_raw <- read_stars(files_OMI , along = list(date = ts_OMI_raw))  %>% 
  setNames("value")
## stars object with 3 dimensions and 1 attribute
## attribute(s):
##      value           
##  Min.   :-9.982e+14  
##  1st Qu.: 4.562e+14  
##  Median : 1.389e+15  
##  Mean   : 1.968e+15  
##  3rd Qu.: 2.728e+15  
##  Max.   : 5.517e+16  
##  NA's   :83574       
## dimension(s):
##      from  to     offset  delta                       refsys point values x/y
## x       1  24       5.25   0.25 GEOGCS["unnamed ellipse",... FALSE   NULL [x]
## y       1  13      45.25   0.25 GEOGCS["unnamed ellipse",... FALSE   NULL [y]
## date    1 365 2019-01-01 1 days                         Date    NA   NULL

The dataset is a 3D spatialtemporal array datacube (stars).

1.1.2 Predictor variables

1.1.2.1 CAMS

# 3D array: x, y, time 
CAMS_raw <- read_stars("1_data/raw/ECMWF-CAMS-NO2/CAMS-NO2.nc") %>%
  st_set_crs(st_crs(4326)) # long-lat coordinates

1.1.2.2 Elevation

# 2D array: x, y
DEM_raw <- read_stars("1_data/raw/EU-DEM-1.1/eu_dem_v11_E40N20_AOI.tif")

1.1.2.3 ERA5 meteorological variables

# 3D array: x, y, time 
ERA_raw <- read_stars("1_data/raw/ECMWF-ERA5/ERA5-meteorology.nc") %>% 
  st_set_crs(st_crs(4326)) # long-lat coordinates

1.1.3 Area of interest (boundary)

CH <- st_read("1_data/raw/Switzerland_shapefile/CHE_adm0.shp") %>% 
  st_geometry() %>% 
  st_simplify(preserveTopology = TRUE , dTolerance = 0.05)
AOI <- st_read("1_data/raw/Switzerland_shapefile/AOI_4326.shp") 

1.2 Resample and reshape spatialtemporal- and spatial datasets

To do the calculation and modeling based on spatially and/or temporally co-locating pixels of the various predictor variables, all of the datasets need to have the same spatial dimensions (x,y aka spatial resolution and coordinate systems) and/or temporal dimension (daily).
For the imputation model of OMI-NO2, all the predictor variables are resampled and reshaped to the dimension of OMI, which is:

  • spatial resolution: 0.25˚*0.25˚
  • temporal resolution: daily * 365 days

Here are the resampling methods applied at this step:

  • Continuous data
    • Upscaling (finer grids to lower grids): bilinear interpolation resampling
    • Downscaling (coarser grids to finer grids): nearest neighbor resampling (a conservative transformation of the raw values without making assumptions)

1.2.1 CAMS

Downscaling (0.75˚ to 0.25˚) with nearest neighbor resampling

CAMS_rsNN <- CAMS_raw %>%
  st_warp(dest = OMI_raw)

Separate the resampled CAMS data into sub-datasets(8*2=16) to have the same dimension as OMI (3-hour to daily)

subset.CAMS_hour <- CAMS_rsNN %>%
  st_get_dimension_values(3) %>%
  hour() %>%
  unique()
# 2 attributes: "tcno2" "tc_no"
subset.CAMS_var <- CAMS_rsNN %>% names()
# subset into 16 3D arrays
for(h in subset.CAMS_hour){
  for(v in subset.CAMS_var){
    # subset the raw dataset: at hour h and for variable (attribute) v
    CAMS_rs_temp <- CAMS_rsNN %>%
      filter(hour(time) == h) %>%
      select(all_of(v)) %>%
      setNames(v) %>%
      # dimension: date
      st_set_dimensions(3 ,
                        values = as_date(st_get_dimension_values(. , 3)) ,
                        names = "date")
    # name the object name as CAMS_rs_{Variable}_{Hour}
    assign(paste0("CAMS_rs_" , v , "_" , h) , CAMS_rs_temp)
    rm(CAMS_rs_temp)
  }
}
# clean intermediate objects of the loop
rm(h , v , subset.CAMS_hour , subset.CAMS_var)

Output: CAMS_rs_{Variable}_{Hour} with

  • {Variable}: tc_no, tcno2
  • {Hour}: 0, 3, 6, 9, 12, 15, 18, 21 ;

and each of these as a 3D array. Here’s an example comaring the original and resampled CAMS data on a random date:

1.2.2 Elevation

Bilinear interpolation

# 25*25m -> 7*13km upscaling for continuous data
# bilinear interpolation
DEM_rs <- DEM_raw %>% 
  st_warp(OMI_raw[,,,1] %>% split("date") , 
          use_gdal = TRUE , method = "bilinear") %>% 
  setNames("DEM")

1.2.3 ERA5 meteorological variables

Nearest-neighbor resampling

ERA_rsNN <- ERA_raw %>% 
  st_warp(dest = OMI_raw) 

Separate the resampled ERAA5 data into sub-datasets(8*2=16) to have the same dimension as OMI (3-hour to daily)

subset.ERA_hour <- ERA_rsNN %>% 
  st_get_dimension_values(3) %>% 
  hour() %>% 
  unique()
# 7 attributes: "u10" "v10" "t2m" "blh" "sp"  "tcc" "tp" 
subset.ERA_var <- ERA_rsNN %>% names()
# subset into 16 3D arrays
for(h in subset.ERA_hour){
  for(v in subset.ERA_var){
    # subset the raw dataset: at hour h and for variable (attribute) v
    ERA_rs_temp <- ERA_rsNN %>% 
      filter(hour(time) == h) %>% 
      select(all_of(v)) %>% 
      setNames(v) %>% 
      # # dimension: date
      st_set_dimensions(3 , 
                        values = as_date(st_get_dimension_values(. , 3)) , 
                        names = "date")
    # name the object name as ERA_rs_{Variable}_{Hour}
    assign(paste0("ERA_rs_" , v , "_" , h) , ERA_rs_temp)
    rm(ERA_rs_temp)
  }
}
# clean intermediate objects of the loop
rm(h , v , subset.ERA_hour , subset.ERA_var)

Output: ERA_rs_{Variable}_{Hour} with

  • {Variable}: u10, v10, t2m, blh, sp, tcc, tp
  • {Hour}: 0, 3, 6, 9, 12, 15, 18, 21 ;

and each of these as a 3D array.

1.2.3.1 Convert u,v wind components to wind speed, wind direction

Conversion formula:

  • \(wd = atan2(-u , -v)\)
  • \(ws = \sqrt{u^2 + v^2}\)
subset.ERA_hour <- ERA_raw %>% 
  st_get_dimension_values(3) %>% 
  hour() %>% 
  unique()
for(h in subset.ERA_hour){
  ERA_rs_wind_temp <- c(
    get(paste0("ERA_rs_u10_" , h)) , 
    get(paste0("ERA_rs_v10_" , h))
  ) %>% 
    # wd = atan2(-u10,-v10)
    # ws = sqrt(u10^2 + v10^2)
    transmute(wd = atan2(-u10 , -v10) , 
              ws = sqrt(u10^2+v10^2))
  # name the object name as CAMS_rs_wind_{Hour}
  assign(paste0("ERA_rs_wind_" , h) , ERA_rs_wind_temp)
}
# clean
rm(h , subset.ERA_hour , ERA_rs_wind_temp)

1.3 Cleaned input dataset for the model

# spatialtemporal datasets: 3D (x,y,date) + attributes --------------------------------
spatialtemporal <- c(
  # OMI 
  OMI_raw %>% 
    setNames("OMI_NO2") , 
  # CAMS NO2
  c(CAMS_rs_tcno2_0 , CAMS_rs_tcno2_3 , CAMS_rs_tcno2_6 , CAMS_rs_tcno2_9 , 
    CAMS_rs_tcno2_12 , CAMS_rs_tcno2_15 , CAMS_rs_tcno2_18 , CAMS_rs_tcno2_21) %>% 
    setNames(c("CAMS_NO2_00" , "CAMS_NO2_03" , "CAMS_NO2_06" , "CAMS_NO2_09" , 
               "CAMS_NO2_12" , "CAMS_NO2_15" , "CAMS_NO2_18" , "CAMS_NO2_21")) , 
  # CAMS NO
  c(CAMS_rs_tc_no_0 , CAMS_rs_tc_no_3 , CAMS_rs_tc_no_6 , CAMS_rs_tc_no_9 , 
    CAMS_rs_tc_no_12 , CAMS_rs_tc_no_15 , CAMS_rs_tc_no_18 , CAMS_rs_tc_no_21) %>% 
    setNames(c("CAMS_NO_00" , "CAMS_NO_03" , "CAMS_NO_06" , "CAMS_NO_09" , 
               "CAMS_NO_12" , "CAMS_NO_15" , "CAMS_NO_18" , "CAMS_NO_21")) , 
  # ERA blh
  c(ERA_rs_blh_0 , ERA_rs_blh_3 , ERA_rs_blh_6 , ERA_rs_blh_9 ,
    ERA_rs_blh_12 , ERA_rs_blh_15 , ERA_rs_blh_18 , ERA_rs_blh_21) %>%
    setNames(c("blh_00" , "blh_03" , "blh_06" , "blh_09" ,
               "blh_12" , "blh_15" , "blh_18" , "blh_21")) , 
  # ERA temperature
  c(ERA_rs_t2m_0 , ERA_rs_t2m_3 , ERA_rs_t2m_6 , ERA_rs_t2m_9 , 
    ERA_rs_t2m_12 , ERA_rs_t2m_15 , ERA_rs_t2m_18 , ERA_rs_t2m_21) %>% 
    setNames(c("t2m_00" , "t2m_03" , "t2m_06" , "t2m_09" , 
               "t2m_12" , "t2m_15" , "t2m_18" , "t2m_21")) , 
  # ERA surface pressure
  c(ERA_rs_sp_0 , ERA_rs_sp_3 , ERA_rs_sp_6 , ERA_rs_sp_9 , 
    ERA_rs_sp_12 , ERA_rs_sp_15 , ERA_rs_sp_18 , ERA_rs_sp_21) %>% 
    setNames(c("sp_00" , "sp_03" , "sp_06" , "sp_09" , 
               "sp_12" , "sp_15" , "sp_18" , "sp_21")) , 
  # ERA total cloud cover
  c(ERA_rs_tcc_0 , ERA_rs_tcc_3 , ERA_rs_tcc_6 , ERA_rs_tcc_9 , 
    ERA_rs_tcc_12 , ERA_rs_tcc_15 , ERA_rs_tcc_18 , ERA_rs_tcc_21) %>% 
    setNames(c("tcc_00" , "tcc_03" , "tcc_06" , "tcc_09" , 
               "tcc_12" , "tcc_15" , "tcc_18" , "tcc_21")) ,
  # ERA total precipitation
  c(ERA_rs_tp_0 , ERA_rs_tp_3 , ERA_rs_tp_6 , ERA_rs_tp_9 , 
    ERA_rs_tp_12 , ERA_rs_tp_15 , ERA_rs_tp_18 , ERA_rs_tp_21) %>% 
    setNames(c("tp_00" , "tp_03" , "tp_06" , "tp_09" , 
               "tp_12" , "tp_15" , "tp_18" , "tp_21")) , 
  # ERA wind
  c(ERA_rs_wind_0 , ERA_rs_wind_3 , ERA_rs_wind_6 , ERA_rs_wind_9 , 
    ERA_rs_wind_12 , ERA_rs_wind_15 , ERA_rs_wind_18 , ERA_rs_wind_21) %>% 
    setNames(c("wd_00", "ws_00", "wd_03", "ws_03", "wd_06", "ws_06", "wd_09", "ws_09",
               "wd_12", "ws_12", "wd_15", "ws_15", "wd_18", "ws_18", "wd_21", "ws_21"))
) %>% 
  st_set_dimensions(3 , values = yday(st_get_dimension_values(. , 3)) , names = "DOY")

# spatial datasets: 2D (x,y) + attributes --------------------------------
spatial <- c(
  # DEM
  DEM_rs
)

After reshaping the data, spatialtemporal is a 3D array (x,y,day) and spatial is a 2D array (x,y). Both matches the resolution of OMI.

Covert the spatialtemporal and spatial arrays to data.frame for the model:

imputation_df <- spatialtemporal %>% 
  as.data.frame() %>% 
  mutate(across(everything() , as.numeric)) %>% 
  left_join(spatial %>% as.data.frame() , 
            by = c("x" , "y")) %>% 
  # Some pixels are outside of the AOI and are always NA, 
  # should be removed from the data.frame so that it wouldn’t be counted.
  full_join(
    # AOI mask in raster form
    AOI %>% 
      st_rasterize(template = spatial) %>% 
      # is.AOI: a column with TRUE/FALSE 
      mutate(is.AOI = ifelse(is.na(FID) , FALSE , TRUE)) %>% 
      as.data.frame(), 
    by = c("x" , "y")
  ) %>% 
  # remove the pixels that are outside AOI (is.AOI=FALSE)
  filter(is.AOI) %>% 
  # remove the FID and is.AOI columns from the AOI mask
  select(-FID , -is.AOI)

1.4 Spatially-blocked 10-fold cross validation

The area of interest is divided into large grids and randomly assigned with IDs (1-10) to apply a spatially-blocked cross validation.

# spatially blocked k-fold cross validation
k_fold <- 10

# spatially-blocked by grids across AOI
set.seed(2201)
# make grids and assign cross validation index
spatialGrid_CV <- AOI %>% 
  # reporject to OMI
  st_transform(st_crs(spatial)) %>% 
  # make grids
  st_make_grid(n = c(10,4)) %>% 
  st_as_sf() %>% 
  # randomly assign cross validation index
  mutate(spatial_CV = sample(1:k_fold , nrow(.) , replace = TRUE))
# rasterize the grids with CV-index to stars
spatialGrid_CV_stars <- st_rasterize(spatialGrid_CV["spatial_CV"], template = spatial) %>% 
  mutate(spatial_CV = as.factor(spatial_CV))
ggplot() +
  # spatial-CV
  geom_stars(data = spatialGrid_CV_stars) +
  # OMI grids
  geom_sf(data = spatial %>% 
            # convert the grids to polygon to visualize the grid cells
            st_as_sf(as_points = FALSE) ,  
          color = "azure1" , fill = NA , alpha = 0.5 , size = 0.2) +
  # Switzerland
  geom_sf(data = CH , fill = NA , color = "white") +
  # AOI
  geom_sf(data = AOI , fill = NA , color = "azure2") +
  scale_fill_npg() +
  coord_sf(expand = FALSE) +
  labs(x = "Longtitude" , y = "Latitude" , fill = "k-fold \ncross \nvalidation \ngroup")
## Warning: Removed 24 rows containing missing values (geom_raster).

1.4.1 Include the spatial CV design into the input data

imputation_df <- spatialGrid_CV_stars %>% 
  setNames("spatial_CV") %>% 
  as.data.frame() %>% 
  # join to the input data
  right_join(imputation_df , by = c("x" , "y")) 

Take a look at the full data imputation_df:

str(imputation_df)
## 'data.frame':    73000 obs. of  78 variables:
##  $ x          : num  9.12 9.12 9.12 9.12 9.12 ...
##  $ y          : num  45.4 45.4 45.4 45.4 45.4 ...
##  $ spatial_CV : Factor w/ 10 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ DOY        : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ OMI_NO2    : num  NA 6.11e+15 NA 1.77e+16 NA ...
##  $ CAMS_NO2_00: num  2.89e-05 9.26e-06 3.48e-06 5.33e-06 1.81e-05 ...
##  $ CAMS_NO2_03: num  2.49e-05 9.35e-06 3.46e-06 5.13e-06 1.82e-05 ...
##  $ CAMS_NO2_06: num  2.41e-05 7.98e-06 3.71e-06 5.45e-06 1.89e-05 ...
##  $ CAMS_NO2_09: num  1.80e-05 4.75e-06 3.27e-06 6.12e-06 1.76e-05 ...
##  $ CAMS_NO2_12: num  1.58e-05 3.48e-06 3.16e-06 9.17e-06 1.03e-05 ...
##  $ CAMS_NO2_15: num  2.01e-05 3.62e-06 5.32e-06 1.42e-05 9.38e-06 ...
##  $ CAMS_NO2_18: num  2.41e-05 4.72e-06 6.32e-06 1.94e-05 8.57e-06 ...
##  $ CAMS_NO2_21: num  2.30e-05 4.37e-06 5.66e-06 2.33e-05 6.81e-06 ...
##  $ CAMS_NO_00 : num  2.79e-06 9.56e-07 6.69e-09 9.69e-07 1.59e-06 ...
##  $ CAMS_NO_03 : num  3.43e-06 1.27e-06 2.05e-08 1.55e-06 2.02e-06 ...
##  $ CAMS_NO_06 : num  3.27e-06 9.71e-07 8.43e-08 2.02e-06 2.31e-06 ...
##  $ CAMS_NO_09 : num  7.98e-06 1.98e-06 1.02e-06 3.73e-06 4.73e-06 ...
##  $ CAMS_NO_12 : num  7.33e-06 1.24e-06 1.33e-06 4.88e-06 4.83e-06 ...
##  $ CAMS_NO_15 : num  3.43e-06 6.54e-07 8.91e-07 2.45e-06 2.60e-06 ...
##  $ CAMS_NO_18 : num  3.51e-07 5.78e-09 2.92e-07 7.32e-07 4.02e-07 ...
##  $ CAMS_NO_21 : num  7.34e-07 6.69e-09 6.70e-07 1.27e-06 4.76e-07 ...
##  $ blh_00     : num  48.8 155.8 452.9 44.8 36.3 ...
##  $ blh_03     : num  74.2 124.7 170.7 25.1 26 ...
##  $ blh_06     : num  40.5 125.1 93.1 20.2 27.3 ...
##  $ blh_09     : num  35.6 1009.3 81.7 45.5 69.2 ...
##  $ blh_12     : num  342 1728 420 292 267 ...
##  $ blh_15     : num  307.9 1424.4 171.9 47.9 198.9 ...
##  $ blh_18     : num  112.9 1015.8 38.1 17.7 94.1 ...
##  $ blh_21     : num  77.9 1110.2 45 18.1 56.3 ...
##  $ t2m_00     : num  276 275 277 274 271 ...
##  $ t2m_03     : num  274 276 276 271 271 ...
##  $ t2m_06     : num  275 275 273 269 272 ...
##  $ t2m_09     : num  276 282 276 272 274 ...
##  $ t2m_12     : num  279 284 281 279 279 ...
##  $ t2m_15     : num  280 283 280 279 282 ...
##  $ t2m_18     : num  276 281 275 275 277 ...
##  $ t2m_21     : num  275 279 273 273 274 ...
##  $ sp_00      : num  102170 101056 101800 102031 101822 ...
##  $ sp_03      : num  102257 100961 101845 102114 101764 ...
##  $ sp_06      : num  102280 101022 101850 102089 101642 ...
##  $ sp_09      : num  102324 101401 101975 102250 101580 ...
##  $ sp_12      : num  102100 101372 101850 102197 101246 ...
##  $ sp_15      : num  101777 101572 101821 102132 101055 ...
##  $ sp_18      : num  101581 101716 101912 102056 101112 ...
##  $ sp_21      : num  101304 101782 102005 102002 101242 ...
##  $ tcc_00     : num  3.09e-01 2.85e-01 5.55e-17 5.55e-17 3.75e-03 ...
##  $ tcc_03     : num  2.55e-01 7.43e-01 5.55e-17 2.39e-01 3.90e-01 ...
##  $ tcc_06     : num  4.69e-01 5.03e-01 5.55e-17 1.20e-01 9.75e-01 ...
##  $ tcc_09     : num  0.7202 0.0468 0.1694 0.1446 0.7507 ...
##  $ tcc_12     : num  6.49e-01 5.55e-17 3.27e-02 3.67e-01 1.40e-03 ...
##  $ tcc_15     : num  4.74e-01 5.55e-17 1.29e-02 6.26e-01 5.55e-17 ...
##  $ tcc_18     : num  3.41e-01 5.55e-17 5.55e-17 6.96e-01 5.55e-17 ...
##  $ tcc_21     : num  3.52e-01 5.55e-17 6.10e-05 2.36e-01 5.55e-17 ...
##  $ tp_00      : num  -8.67e-19 -8.67e-19 -8.67e-19 -8.67e-19 -8.67e-19 ...
##  $ tp_03      : num  -8.67e-19 4.17e-07 -8.67e-19 -8.67e-19 -8.67e-19 ...
##  $ tp_06      : num  -8.67e-19 -8.67e-19 -8.67e-19 -8.67e-19 -8.67e-19 ...
##  $ tp_09      : num  -8.67e-19 7.91e-06 -8.67e-19 -8.67e-19 -8.67e-19 ...
##  $ tp_12      : num  8.33e-07 -8.67e-19 -8.67e-19 -8.67e-19 -8.67e-19 ...
##  $ tp_15      : num  -8.67e-19 -8.67e-19 -8.67e-19 -8.67e-19 -8.67e-19 ...
##  $ tp_18      : num  -8.67e-19 -8.67e-19 -8.67e-19 -8.67e-19 -8.67e-19 ...
##  $ tp_21      : num  4.17e-07 -8.67e-19 -8.67e-19 -8.67e-19 -8.67e-19 ...
##  $ wd_00      : num  1.328 -1.811 -0.335 2.696 -2.096 ...
##  $ ws_00      : num  2.1 4.11 4.12 2.75 1.17 ...
##  $ wd_03      : num  1.273 -1.699 -0.489 -0.25 -2.537 ...
##  $ ws_03      : num  2.46 3.8 3.29 1.45 1.68 ...
##  $ wd_06      : num  1.028 -0.917 -0.661 -2.219 -2.383 ...
##  $ ws_06      : num  2.241 1.944 2.643 0.992 1.157 ...
##  $ wd_09      : num  -0.949 0.38 -1.242 -2.419 -2.109 ...
##  $ ws_09      : num  0.264 4.674 1.251 0.546 1.918 ...
##  $ wd_12      : num  -1.977 0.138 -2.16 -1.924 -1.958 ...
##  $ ws_12      : num  1.17 7.22 2.04 1.54 2.49 ...
##  $ wd_15      : num  -1.362 0.289 -2.803 -1.933 -1.883 ...
##  $ ws_15      : num  1.4 5.14 1.47 1.11 3.84 ...
##  $ wd_18      : num  -1.526 0.199 -1.857 -2.821 -1.819 ...
##  $ ws_18      : num  2.31 4.95 2.11 1.03 3.75 ...
##  $ wd_21      : num  -1.5436 -0.0215 -0.6 -2.3604 -1.4804 ...
##  $ ws_21      : num  2.37 4.65 2.13 1.22 2.13 ...
##  $ DEM        : num  128 128 128 128 128 ...

2 Model development

Random forest is used to model the relationship between \(OMI-NO_2\) and the covariates. The first step is to optimize the predictor variable set that is included in the model.

2.1 Grid-search: optimized predictor variable set by hour and variables

For the CAMS-Global-Reanalysis data and the ERA5 data, we included the data for the whole day with 3-hour interval at the first place (0H, 3H, 6H, 9H, 12H, 15H, 18H, 21H) and let the model decide which one to use, based on model performance indices like \(R^2\). We used a grid search to try the models with data of different timesteps and compare the OOB- and CV-\(R^2\). Additionally, we looked at different combination (subsets) of the predictor variables. (Preliminary tryings, which include purely-random values in the predictor variables, suggest that total precipitation tp is as useless as random values to the response variable by looking at the variable importance output of the random forest model.) The three predictor variable combination sets are

  • base: CAMS-NO2, CAMS-NO, DOY, altitude, x, y
  • base_meteo6: base + temp, blh, tcc, sp, ws, wd
  • base_meteo7: base + temp, blh, tcc, sp, ws, wd, tp
# all combinations
search_hour <- c("00" , "03" , "06" , "09" , "12" , "15" , "18" , "21")
search_var <- list(
  # base model
  base = c("x" , "y" , "DOY" , "DEM" , "CAMS_NO2_" , "CAMS_NO_") , 
  # base + temperature, blh, tcc, sp, wd, ws
  base_meteo6 = c("x" , "y" , "DOY" , "DEM" , "CAMS_NO2_" , "CAMS_NO_" , "t2m_" , "blh_" , "tcc_" , "sp_" , "wd_" , "ws_") ,
  # base + all meteorological variables
  base_meteo7 = c("x" , "y" , "DOY" , "DEM" , "CAMS_NO2_" , "CAMS_NO_" , "t2m_" , "blh_" , "tcc_" , "sp_" , "wd_" , "ws_" , "tp_")
)
# grid search over every combination
N.trees <- 500
gridSearch_hour_var <- list()
gridSearch_pb <- txtProgressBar(min = 0, max = length(search_hour) * length(search_var) , style = 3)
for(h in 1:length(search_hour)){
  for(v in 1:length(search_var)){
    # progress bar
    setTxtProgressBar(gridSearch_pb , length(search_var) * (h-1) + v)
    result_list_hv <- list()
    # subset dataset for the predictor variable combination
    modelInput_df.cv <- imputation_df %>%
      # exclude rows with missing predictor/response value
      filter(!if_any(everything() , is.na)) %>%
      # only the hour in search_hour
      select(x , y , OMI_NO2 , DOY , DEM , ends_with(search_hour[h]) , spatial_CV) %>%
      # only the variables in search_var
      select(OMI_NO2 ,contains(search_var[[v]]) , spatial_CV)
    # model with full training data
    rf_temp <- ranger(OMI_NO2 ~ . ,
                      data = modelInput_df.cv %>% select(-spatial_CV) ,
                      importance = "impurity" ,
                      num.trees = N.trees , 
                      keep.inbag = TRUE)
    # store OOB-R2 of the model
    result_list_hv$OOB_R2 <- rf_temp$r.squared
    # store variable importance of the model
    result_list_hv$varImportance <- sort(rf_temp$variable.importance , decreasing = TRUE)
    # cross-validation model
    result_list_hv$CV_R2_k <- c()
    for(k in 1:k_fold){
      rf_temp_cv <- ranger(OMI_NO2 ~ . ,
                           data = modelInput_df.cv %>% 
                             filter(spatial_CV != as.character(k)) %>% 
                             select(-spatial_CV) ,
                           importance = "impurity" ,
                           keep.inbag = TRUE)
      rf_temp_cv_pred <- predict(rf_temp_cv , 
                                 data = modelInput_df.cv %>% 
                                   filter(spatial_CV == as.character(k)) %>% 
                                   select(-spatial_CV))
      rf_temp_cv_obs_pred <- modelInput_df.cv %>% 
        filter(spatial_CV == as.character(k)) %>% 
        select(OMI_NO2) %>% 
        rename(obs = OMI_NO2) %>% 
        mutate(pred = rf_temp_cv_pred$predictions)
      result_list_hv$CV_R2_k[k] <- cor(rf_temp_cv_obs_pred$obs , rf_temp_cv_obs_pred$pred)^2
    }
    result_list_hv$CV_R2 <- mean(result_list_hv$CV_R2_k)
    # store results
    if(v == 1) gridSearch_hour_var[[h]] <- list()
    gridSearch_hour_var[[h]][[v]] <- result_list_hv
  }
  # set names in the list
  names(gridSearch_hour_var[[h]]) <- names(search_var)
}
rm(h,v,k)
names(gridSearch_hour_var) <- paste0("HOUR_" , search_hour)

2.1.1 Result

gridSearch_OOB_R2 <- gridSearch_hour_var %>% 
  unlist() %>% 
  as.data.frame() %>% 
  rename(.data = . , value = `.`) %>% 
  mutate(var = rownames(.)) %>% 
  filter(grepl("OOB_R2$" , var)) %>% 
  rename(OOB_R2 = value) %>% 
  separate(col = var , into = c("Hour" , "model" , "var") , sep = "\\.") %>% 
  select(-var) %>% 
  mutate(Hour = gsub("HOUR_" , "" , Hour)) %>% 
  as_tibble()
gridSearch_CV_R2 <- gridSearch_hour_var %>% 
  unlist() %>% 
  as.data.frame() %>% 
  rename(.data = . , value = `.`) %>% 
  mutate(var = rownames(.)) %>% 
  filter(grepl("CV_R2$" , var)) %>% 
  rename(CV_R2 = value) %>% 
  separate(col = var , into = c("Hour" , "model" , "var") , sep = "\\.") %>% 
  select(-var) %>% 
  mutate(Hour = gsub("HOUR_" , "" , Hour)) %>% 
  as_tibble()
cowplot::plot_grid(
  # visualization: OOB-R2
  gridSearch_OOB_R2 %>% 
    ggplot(aes(x = Hour , y = model , fill = OOB_R2)) + 
    geom_raster() +
    geom_text(aes(label = round(OOB_R2 , 3)) , color = "white" , size = 3) +
    coord_cartesian(expand = FALSE) +
    scale_fill_gradientn(colours = RColorBrewer::brewer.pal(5 , "YlGnBu")) +
    labs(x = "Hour (CAMS and ERA5)" , y = "Predictor variable set" , 
         fill = expression("OOB-R"^2)) , 
  # visualization: CV-R2
  gridSearch_CV_R2 %>% 
    ggplot(aes(x = Hour , y = model , fill = CV_R2)) + 
    geom_raster() +
    geom_text(aes(label = round(CV_R2 , 3)) , color = "white", size = 3) +
    coord_cartesian(expand = FALSE) +
    scale_fill_gradientn(colours = RColorBrewer::brewer.pal(7 , "BuGn")) +
    labs(x = "Hour (CAMS and ERA5)" , y = "Predictor variable set" , 
         fill = expression("CV-R"^2)) , 
  ncol = 2
)

From the result of the grid search over the hours, we can first observe that the models using CAMS- and ERA5-variables at 12H and 15H have the higher \(R^2\). Therefore in the next step we further look at different combination of predictor variables at 12H. However note that although the models including meteorological variables return higher OOB-\(R^2\) compared to base, it’s not the case when we look at CV-\(R^2\). The models with meteorological variables might be overfitting. Consequently in the next step we also compare base model with the other models with more predictor variables.

2.2 Grid-search: hour fixed, different subsets of variables

Here we first include all the predictor variables (12H for CAMS and ERA5) in a model (full model), and record the order of variable importance reported by the model. Then a grid search is applied where the i least important variables are removed at a time. Additionally, like mentioned above, the base model which does not consider any meteorological variables is also compared.

modelInput_df <- imputation_df %>%
  select(x , y , DOY , OMI_NO2 , DEM , ends_with("_12") , spatial_CV) %>%
  # exclude rows with missing predictor/response value
  filter(!if_any(everything() , is.na)) 

# grid search: exclude i least important variables at a time
# initial: all predictor variables
rf_initial <- ranger(OMI_NO2 ~ x + y + DOY + DEM + CAMS_NO2_12 + CAMS_NO_12 + 
                       blh_12 + t2m_12 + sp_12 + tcc_12 + tp_12 + wd_12 + ws_12  , 
                     data = modelInput_df ,
                     importance = "impurity" ,
                     keep.inbag = TRUE)
inputVar_ordered <- rf_initial$variable.importance %>% 
  sort(decreasing = TRUE) %>% 
  names()
rm(rf_initial)
# grid search 
gridSearch_12H <- list()
gridSearch_i <- 1:7
N.trees <- 500
for(i in gridSearch_i){
  gridSearch_12H[[i]] <- list()
  # exclude i-1 least importance variables (starting from initial aka excluding no variable)
  formula_rf.temp <- sprintf("OMI_NO2 ~ %s" , 
                             paste(inputVar_ordered[1:(length(inputVar_ordered)-i+1)] , collapse = '+')) %>% 
    formula() # making the formula object
  # also try: no ERA-5 at all
  if(i == max(gridSearch_i)) formula_rf.temp = OMI_NO2 ~ x + y + DOY + DEM + CAMS_NO2_12 + CAMS_NO_12
  gridSearch_12H[[i]]$formula <- formula_rf.temp
  # train model
  rf_temp <- ranger(formula_rf.temp , 
                    data = modelInput_df ,
                    importance = "impurity" ,
                    num.trees = N.trees , 
                    keep.inbag = TRUE)
  # store OOB-R2 result
  gridSearch_12H[[i]]$OOB_R2 <- rf_temp$r.squared
  # store importance
  gridSearch_12H[[i]]$varImportance <- rf_temp$variable.importance
  # cross validation
  gridSearch_12H[[i]]$CV_R2_k <- c()
  gridSearch_12H[[i]]$CV_slope_k <- c()
  gridSearch_12H[[i]]$CV_RMSE_k <- c()
  for(k in 1:k_fold){
    # model
    rf_temp_cv <- ranger(formula_rf.temp ,
                         data = modelInput_df %>% 
                           filter(spatial_CV != as.character(k)) ,
                         importance = "impurity" ,
                         keep.inbag = TRUE)
    # prediction on test set
    rf_temp_cv_pred <- predict(rf_temp_cv , 
                               data = modelInput_df %>% 
                                 filter(spatial_CV == as.character(k)) )
    rf_temp_cv_obs_pred <- modelInput_df %>% 
      filter(spatial_CV == as.character(k)) %>% 
      select(OMI_NO2) %>% 
      rename(obs = OMI_NO2) %>% 
      mutate(pred = rf_temp_cv_pred$predictions)
    # calculate CV-R2
    gridSearch_12H[[i]]$CV_R2_k[k] <- cor(rf_temp_cv_obs_pred$obs , rf_temp_cv_obs_pred$pred)^2
    # calculate CV-slope
    gridSearch_12H[[i]]$CV_slope_k[k] <- lm(obs ~ pred , data = rf_temp_cv_obs_pred)$coefficients[2]
    # calculate CV-RMSE
    gridSearch_12H[[i]]$CV_RMSE_k[k] <- rf_temp_cv_obs_pred %>% 
      summarize(RMSE = sqrt(sum(((pred-obs)^2)/n())) ) %>% 
      unlist
    # clean
    rm(rf_temp_cv , rf_temp_cv_pred , rf_temp_cv_obs_pred)
  }
  # result: mean CV-R2, mean slope, mean RMSE
  gridSearch_12H[[i]]$CV_R2 <- mean(gridSearch_12H[[i]]$CV_R2_k)
  gridSearch_12H[[i]]$CV_slope <- mean(gridSearch_12H[[i]]$CV_slope_k)
  gridSearch_12H[[i]]$CV_RMSE <- mean(gridSearch_12H[[i]]$CV_RMSE_k)
  # clean
  rm(formula_rf.temp , rf_temp)
}
rm(i,k)
names(gridSearch_12H) <- paste0("subset" , gridSearch_i - 1)

2.2.1 Result

cowplot::plot_grid(
  # visaulization of grid search result: OOB- and CV-R2
  data.frame(
    subset = gridSearch_i - 1 , 
    OOB_R2 = sapply(gridSearch_12H , function(mylist)mylist$OOB_R2 ) , 
    CV_R2 = sapply(gridSearch_12H , function(mylist)mylist$CV_R2 )
  ) %>% 
    pivot_longer(cols = c(OOB_R2 , CV_R2) , names_to = "var") %>% 
    ggplot(aes(x = subset , y = value , color = var) ) +
    geom_point() +
    geom_line() +
    scale_color_discrete(labels = c(CV_R2 = "Spatially-blocked cross validation" , OOB_R2 = "Out-of-bag")) +
    labs(x = "# variables excluded" , y = expression("R"^2) , 
         color = "Test set") +
    theme_bw() +
    theme(legend.position = "right") , 
  # visaulization of grid search result: CV-RMSE
  data.frame(
    subset = gridSearch_i - 1 , 
    CV_RMSE = sapply(gridSearch_12H , function(mylist)mylist$CV_RMSE )
  ) %>% 
    ggplot(aes(x = subset , y = CV_RMSE)) +
    geom_point() +
    geom_line() +
    labs(x = "# variables excluded" , y = "CV-RMSE") +
    theme_bw() , 
  align = "v" , axis = "lr" , ncol = 1 , rel_heights = c(6,4)
)

The models on the x axis are respectively:

  • the full model
  • full model - tp
  • full model - x - tp
  • full model - ws - x - tp
  • full model - tcc - ws - x - tp
  • full model - wd - tcc - ws - x - tp
  • only base model (CAMS-NO2, CAMS-NO, DOY, altitude, x, y)

In this case, we can see that the model with the highest CV-\(R^2\) and CV-RMSE is the base model, although the OOB-\(R^2\)s are all similar. It indicates that including the meteorological variables may not improve the model and introduce potential overfitting to the model. From this observation, we decide to drop the meteorological variables for the imputation model for OMI and only include CAMS-NO2, CAMS-NO, DOY, altitude, x, y.

Here we go back and look at 2.1.1 Result again, the model using 15H CAMS data (15H, base model) actually has higher OOB- and CV-\(R^2\). Therefore we adopt this for predictor variables of the final model.

2.3 Final model

The final model is \(OMI_{ij} \sim CAMS_{15}(NO_2)_{ij} + CAMS_{15}(NO)_{ij} + altitude_{i} + DOY{j} + x_i + y_i\), which uses the predictor variables:

  • CAMS total column NO2 at grid i on day j at 15:00UTC
  • CAMS total column NO at grid i on day j at 15:00UTC
  • altitude at grid i
  • day of year (1-365) on day j
  • centroid geographic coordinates of grid i
formula_rf_final <- OMI_NO2 ~ CAMS_NO2_15 + CAMS_NO_15 + DEM + DOY + y + x

And the hyperparameters of the random forest model are chosen to be: num.trees=1000 and mtry=5 after grid search over possible candidate value combinations. (The tedious process was not demonstrated here.)

modelInput_df <- imputation_df %>%
  select(x , y , DOY , OMI_NO2 , DEM , ends_with("_15") , spatial_CV) %>%
  # exclude rows with missing predictor/response value
  filter(!if_any(everything() , is.na)) 
rf_final <- ranger(formula_rf_final , 
                   data = modelInput_df , 
                   num.trees = 1000 , 
                   mtry = 5 , 
                   importance = "impurity" ,
                   keep.inbag = TRUE)
rf_final
## Ranger result
## 
## Call:
##  ranger(formula_rf_final, data = modelInput_df, num.trees = 1000,      mtry = 5, importance = "impurity", keep.inbag = TRUE) 
## 
## Type:                             Regression 
## Number of trees:                  1000 
## Sample size:                      29354 
## Number of independent variables:  6 
## Mtry:                             5 
## Target node size:                 5 
## Variable importance mode:         impurity 
## Splitrule:                        variance 
## OOB prediction error (MSE):       1.339053e+30 
## R squared (OOB):                  0.7601374

2.4 Evaluation of the final model

2.4.1 OOB-\(R^2\)

rf_final$r.squared
## [1] 0.7601374

2.4.2 goodness-of-fit (slope, intercept)

rf_final_pred <- rf_final$predictions # OOB-predictions
rf_final_obs_pred <- modelInput_df %>% 
  select(x,y,DOY,OMI_NO2) %>% 
  rename(obs = OMI_NO2) %>% 
  mutate(pred = rf_final_pred)
summary(lm(obs ~ pred , data = rf_final_obs_pred))
## 
## Call:
## lm(formula = obs ~ pred, data = rf_final_obs_pred)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.474e+16 -4.070e+14  2.831e+12  3.336e+14  1.988e+16 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.037e+14  9.527e+12  -10.88   <2e-16 ***
## pred         1.042e+00  3.404e-03  306.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.154e+15 on 29352 degrees of freedom
## Multiple R-squared:  0.7614, Adjusted R-squared:  0.7614 
## F-statistic: 9.368e+04 on 1 and 29352 DF,  p-value: < 2.2e-16
  • slope = 1.04±0.003
  • intercept = -1.037e14 ± 9.527e12
rf_final_obs_pred %>% 
  ggplot(aes(x = pred , y = obs)) +
  geom_abline(slope = 1 , intercept = 0 , linetype = 2 , color = "azure4") +
  geom_point(shape = 1 , alpha = 0.5) +
  geom_smooth(method = "lm") +
  coord_fixed(1) +
  labs(x = expression("Modeled OMI-NO"[2]) , y = expression("Observed OMI-NO"[2])) +
  theme_bw()
## `geom_smooth()` using formula 'y ~ x'

2.4.3 cross validation \(R^2\)

rf_final_CV_R2_k <- c()
for(k in 1:k_fold){
  # model
  rf_temp_cv <- ranger(formula_rf_final ,
                       data = modelInput_df %>% 
                         filter(spatial_CV != as.character(k)) ,
                       importance = "impurity" ,
                       keep.inbag = TRUE , 
                       num.trees = rf_final$num.trees , 
                       mtry = rf_final$mtry)
  # prediction on test set
  rf_temp_cv_pred <- predict(rf_temp_cv , 
                             data = modelInput_df %>% 
                               filter(spatial_CV == as.character(k)) )
  rf_temp_cv_obs_pred <- modelInput_df %>% 
    filter(spatial_CV == as.character(k)) %>% 
    select(OMI_NO2) %>% 
    rename(obs = OMI_NO2) %>% 
    mutate(pred = rf_temp_cv_pred$predictions)
  # calculate CV-R2
  rf_final_CV_R2_k[k] <- cor(rf_temp_cv_obs_pred$obs , rf_temp_cv_obs_pred$pred)^2
  # clean
  rm(rf_temp_cv , rf_temp_cv_pred , rf_temp_cv_obs_pred)
}
# mean CV-R2
mean(rf_final_CV_R2_k)
## [1] 0.5949998
range(rf_final_CV_R2_k)
## [1] 0.4836815 0.7082183

10-fold-CV-\(R^2\): 0.595 (0.484—0.708)


3 Implementation

Use the full datasets (all pixel-days) and predict OMI_NO2 with the final model:

projected_rf <- predict(rf_final , 
                        data = imputation_df %>% 
                          filter(!is.na(DEM)) , 
                        type = "se")

Prepare the projected data as data frame (along with the predictor variables):

projected <- imputation_df %>% 
  filter(!is.na(DEM)) %>% 
  select(x,y,DOY,OMI_NO2) %>% 
  # add the projected (model-predicted) OMI-NO2
  mutate(OMI_NO2_pred = projected_rf$predictions , 
         OMI_NO2_pred_se = projected_rf$se) %>% 
  # set negative values to 0
  mutate(OMI_NO2_pred = ifelse(OMI_NO2_pred < 0 , 0 , OMI_NO2_pred))

Convert the data frame to spatialtemporal array:

projected_stars <- projected %>% 
  # convert DOY to date
  mutate(DOY = as_date(DOY , origin = "2018-12-31")) %>% 
  # convert to stars
  st_as_stars(dims = c("x" , "y" , "DOY")) %>% 
  # rename dimension DOY to date
  st_set_dimensions(3 , names = "date") %>% 
  # set coordinate system
  st_set_crs(st_crs(OMI_raw)) %>% 
  # de-select the attribute "DOY"
  select(-DOY) %>% 
  # imputed: observed + predicted
  mutate(OMI_NO2_imputed = ifelse(is.na(OMI_NO2) , OMI_NO2_pred , OMI_NO2))

Here we can look at the final projection result:

projected_stars
## stars object with 3 dimensions and 4 attributes
## attribute(s):
##     OMI_NO2           OMI_NO2_pred        OMI_NO2_pred_se    
##  Min.   :-9.982e+14   Min.   :0.000e+00   Min.   :1.235e+12  
##  1st Qu.: 4.428e+14   1st Qu.:7.877e+14   1st Qu.:2.010e+14  
##  Median : 1.363e+15   Median :1.759e+15   Median :3.978e+14  
##  Mean   : 1.959e+15   Mean   :2.187e+15   Mean   :5.428e+14  
##  3rd Qu.: 2.714e+15   3rd Qu.:3.011e+15   3rd Qu.:7.151e+14  
##  Max.   : 5.517e+16   Max.   :4.384e+16   Max.   :1.200e+16  
##  NA's   :67006        NA's   :25915       NA's   :49026      
##  OMI_NO2_imputed     
##  Min.   :-9.982e+14  
##  1st Qu.: 7.309e+14  
##  Median : 1.734e+15  
##  Mean   : 2.181e+15  
##  3rd Qu.: 3.033e+15  
##  Max.   : 5.517e+16  
##  NA's   :25915       
## dimension(s):
##      from  to     offset  delta                       refsys point values x/y
## x       1  22       5.75   0.25 GEOGCS["unnamed ellipse",...    NA   NULL [x]
## y       1  12      48.25  -0.25 GEOGCS["unnamed ellipse",...    NA   NULL [y]
## date    1 365 2019-01-01 1 days                         Date    NA   NULL

projected_stars is a 3D array (x,y,date) with 4 attributes. OMI_NO2 is the raw OMI pixels, OMI_NO2_pred is the model-predicted values, OMI_NO2_pred_se is the prediction standard error reported by ranger (the random forest model), and OMI_NO2_imputed is the final imputed product. Pixels that exist in the original OMI observation stay the same in OMI_NO2_imputed, whereas missing pixels are filled in with the model-predicted values (OMI_NO2_pred).

We can visualize the projection result (of some random dates):

selected_dates <- interval("2019-02-01" , "2019-02-10")
ggplot() +
  geom_stars(
    data = projected_stars %>% 
      filter(date %within% selected_dates) %>% 
      merge() %>% 
      st_set_dimensions(4 , values = c("observed" , "predicted" , "se" , "imputed")) 
  ) +
  geom_sf(data = CH , fill = NA , color = "white") +
  scale_fill_viridis_c(limits = c(0,1e16), oob = scales::squish) +
  coord_sf(expand = FALSE) +
  facet_grid(attributes~date) +
  labs(x = "Longtitude" , y = "Latitude" , fill = "TropColumnNO2") +
  theme(axis.text = element_text(size = 6))

As an alternative for the visualization, an animation is used:

library(gganimate)
annimation_obs_pred <- ggplot() +
  geom_stars(
    data = projected_stars %>%
      select(OMI_NO2 , OMI_NO2_pred , OMI_NO2_imputed) %>%
      merge() %>%
      st_set_dimensions(4 , values = c("observed" , "predicted" , "imputed"))
  ) +
  geom_sf(data = CH , fill = NA , color = "white") +
  scale_fill_viridis_c(limits = c(0,1e16), oob = scales::squish) +
  coord_sf(expand = FALSE) +
  facet_grid(~attributes) +
  transition_time(date) +
  labs(x = "Longtitude" , y = "Latitude" , fill = "TropColumnNO2" ,
       title = "Date: {frame_time}") +
  theme(axis.text = element_text(size = 6))
animate(annimation_obs_pred , duration = 365 , fps = 1)

Note that some pixels are “not visualized” due to the limits setting of the color symbology, not missing!

Export the projection result:

if(!dir.exists("1_data/processed/OMI_imputed")) dir.create("1_data/processed/OMI_imputed")
projected_stars %>% 
  select(OMI_NO2_imputed) %>% 
  write_stars(dsn = "1_data/processed/OMI_imputed/OMI-Aura_L3-OMI_MINDS_NO2d_2019_daily_imputed_AOI.nc")